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Abstract 

Fast and robust hand segmentation and tracking is an essential basis for gesture 
recognition and thus an important component for contact-less human-computer inter- 
action (HCI). Hand gesture recognition based on 2D video data has been intensively 
investigated. However, in practical scenarios purely intensity based approaches suffer 
from uncontrollable environmental conditions like cluttered background colors. 
In this paper we present a real-time hand segmentation and tracking algorithm us- 
ing Time-of-Flight (ToF) range cameras and intensity data. The intensity and range 
information is fused into one pixel value, representing its combined intensity-depth 
homogeneity. The scene is hierarchically clustered using a GPU based parallel merg- 
ing algorithm, allowing a robust identification of both hands even for inhomogeneous 
backgrounds. After the detection, both hands are tracked on the CPU. Our tracking 
algorithm can cope with the situation that one hand is temporarily covered by the other 
hand. 

1. Introduction 

Gesture-based real-time human-computer interaction requires a fast and robust segmentation of 
the human hands [ ]. Classical approaches are based on 2D intensity or color images. However, 
this kind of techniques suffers from low efficiency and the lack of robustness in case of cluttered 
scenes or if applied under varying lighting conditions. Addressing practical application scenarios, 
techniques capable of handling both effects are strongly required; frequently applied simplifica- 
tions, e.g. restricted lighting or material conditions [9], or marker- or glove-based approaches [16] 
are hardly applicable. 

One major approach to overcome the problems of segmenting intensity or color image sequences 
for gesture recognition purposes is to use additional depth information, delivered by laser range 
systems [ ], stereo cameras [ ] or structured light range acquisition systems [ ]. The major 
drawback of all these approaches is the comparably expensive sensing hardware and the sig- 
nificant space requirement, which is due to systematic constraints, e.g. the baseline required for 
stereo techniques or structured light, or mechanical setups in case of laser range scanners. 

Recently, the Time-of-Flight (ToF) technology, based on measuring the time that light emitted by 
an illumination unit requires to travel to an object and back to a detector, has been manufactured 
as highly integrated ToF cameras. Unlike the other 3D acquisition systems, ToF cameras are 
very compact. ToF-cameras are realized in standard CMOS or CCD technology and thus can be 
cost efficiently manufactured [10, 20]. ToF-cameras have been successfully applied in the context 
of man-machine interaction, e.g. for facial tracking [ ], for touch-free navigation in 3D medical 
applications [15], upper-body-gesture [ ] and hand-gesture recognition [4]. 

In this paper, we introduce a hand segmentation approach based on a hierarchical clustering 
technique. Using hierarchical clustering is beneficial, since the final number of clusters of the 
scene delivering the "best" hand segmentations highly depends on the scene complexity and thus 
can not be determined beforehand. To achieve a high performance, we adopt the GPU-based 
clustering approach introduced by Chiosa and Kolb [3] to cluster fused range-intensity images. 
In this context, we introduce a novel homogeneity criterion for range-intensity images. Our hand 
segmentation and tracking system is capable of robustly detecting and tracking both hands, even 
under the condition one hand is temporarily covered by the other hand or in case a distructing 
object, e.g. a third hand appears. 
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The reminder of the paper is structured as follows. In Sec. 2 we discuss major prior work on 
segmentation and tracking techniques. Sec. 3 gives an overview on our hand tracking system, 
followed by detailed discussions on the applied hierarchical clustering method (Sec.4) and the 
hand tracking (Sec. 5). Sec. 6 presents some experimental results of our tracking approach and 
in Sec. 7 we draw some final conclusions. 

2. Related Work 

2.1. Segmentation 

The most important basis for a hand tracking system is a robust and fast clustering algorithm in 
order to segment the hands from their environment. The Graph Cut algorithms, or Graph-based 
clustering algorithms [ ], used by Schoenberg et al. [14], delivers good clustering results. How- 
ever, with a runtime of approximately 3-4 FPS (frames per second) for a 204 2 pixel sized im- 
age, this method is unsuitable for real-time applications. Another, simpler approach is used by 
Breuer et al. [2], where only the depth information is used to segment one hand from the back- 
ground. By depth keying the background, only the nearest 3D point cloud will be detected as hand. 
This method however is too inflexible because the hand is restricted to a certain spatial position. 
Furthermore a hand segmentation based on range data only is too error-prone. Holte et al. [8] 
present a motion detection approach, where only the moving arms are segmented. In order to 
extract the moving arms Holte et al.used 3D double difference images and represent this data by 
their shape context scheme. Tsap [ ] and Tsap and Shin [ ] use a connected component anal- 
ysis of skin-colored pixels, to segment the hands and the face. Only clusters with a certain shape 
will be selected as possible hand clusters. In [ ] Ghobadi et al. propose a segmentation technique 
which combines two clustering approaches to cluster fused range intensity images: k-means and 
expectation maximization. The drawback of this method is the necessity to decide on the number 
of clusters beforehand. 

2.2. Tracking 

Tsap and Shin [18] use a simple depth analysis to distinguish the hand from the face, defining 
the skin colored cluster nearest to the camera as the hand. Furthermore they limit the depth 
space search for the hand dynamically by using the motion history of the last frames. In [17], 
Tsap tracks only non-static skin-colored objects by applying a set of filters on color, motion, and 
range data. After segmentation, Breuer et al. [ ] apply a two-stage approach to detect the hand 
based on principal component analysis and a refinement based on a model-fitting in object space. 
Holte et al. [8] use the shape context (see Sec. 2.1) of the segmented moving arms for gesture 
recognition. A gesture is recognized by matching the current harmonic shape context with a known 
set, one for each possible gesture. Haker et al. [6] use a simple classifier to detect the human 
nose in fused intensity-range images for facial tracking. Soutschek et al. [ ] use an approach to 
detect the hand as the closest object with regard to the ToF-camera. They combine thresholding 
applied to depth and amplitude information and an explicit palm detection in order to remove the 
forearm geometry. In [ ], Ghobadi et al. apply their k-means clustering technique proposed in [5] to 
interactively track the hand on a frame-to-frame basis using haar-like features in combination with 
AdaBoost. 
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3. System Overview 

The clustering and hand tracking algorithm is shown in Fig. 1. Summarizing the procedure, our 
system consists of the two major components, the hierarchical clustering and the tracking compo- 
nent. 

3.0.0.1. Hierarchical Clustering After receiving the range and intensity information from the 
camera, the GPU based parallel merging algorithm clusters the fused range-intensity data into 
regions (see Sec. 4.1). The algorithm searches the mutual, optimal merge partner for every region 
according to some homogeneity criterion. If this is the case, both regions will be merged into one 
and the region size and data will be updated. The clustering stops, when there are no optimal 
merge partners left (see Sec. 4.2). 

3.0. 0.2. Tracking The tracking algorithm analyzes the clustering result. If the system is still in 
the initialization phase, the two clusters nearest to the camera, which have a given minimum size, 
will be assigned as the first hand and the second hand (see Sec. 5.1). After the initialization, 
from the current set of clusters two new hand clusters are chosen by comparing their homogeneity 
values to the hand-clusters from the last frame. The hands are assigned trough a nearest neighbor 
method (see Sec. 5.2). If only one cluster was found, we assume that one hand is covered by the 
other, the lost hand will be marked. The lost hand will be searched in the subsequent frames and 
will only be marked as found, if a cluster satisfies specific homogeneity and space criteria (see 
Sec. 5.3). 

4. Segmentation (Clustering) 

A hierarchical clustering algorithm, implemented on the GPU as a parallel merging algorithm, 
is used in order to segment the hands from their environment. The advantage of a hierarchical 
approach is based on the fact, that the number of clusters, and thus the size and shape of the 
hand clusters, results only from the applied homogeneity criterion. This means that no predefined 
number is necessary. 

The parallel clustering method (see Sec. 4.1 ), as well as the merging criterion (see Sec. 4.2) which 
is used to merge pixels and clusters into new clusters, are discussed in the following subsections. 

4.1. Parallel Merging Algorithm 

The Parallel Merging Algorithm is a hierarchical clustering method [ ], where initially each pixel 
represents a region (cluster) with his own region-ID and region characteristics. 

For merging two regions %i,%2, they have to satisfy a given boolean criterion based on a homo- 
geneity descriptor w' = (w\ , . . . , w l N ) , i = 1 , 2 consisting of N components, i.e. 

<tu i=l,.",#. (1) 

Based on the the individual homogeneity components wu the homogeneity difference between 
two regions is deduced 

N 

/(«l 5 «2) = £cX r |wJ-W?| J (2) 

where a,- is the weight of the individual homogeneity components. 
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Given a specific cluster ^, all neighboring clusters 5 that satisfy the homogeneity criterion are 
identified and the optimal merge partner ^* is selected, i.e. 



After identifying the optimal merge partner ^* for ^, both regions will be merged into a single 
region and the homogeneity measure for the merged region is determined. The new merged region 
gets the greater region ID of both merge partners and its characteristics will be updated. However, 
with a sequential merging algorithm, the merging result depends on the merging order (see Fig. 2). 
A better solution is therefore to check for all regions its optimal merge partner. Two regions are 
merged only if the optimal merge partner choice is mutual, i.e. if (X*)* = $t. 

The merging of regions is performed in parallel and thus the following rules must be obeyed in 
order to obtain a correct result [19]: 

1 . Each region can only merge with one other region at a time; that being the neighbor which 
best satisfies the merging criteria. 

2. In case of an equal measure the neighbor with the larger ID is selected in order to prevent 
cyclic dependencies. 

3. A merge choice must be mutual in order for two regions to merge. 

These rules restrict a region to only merge with a single neighbor during any merging iteration. 
The resulting region could otherwise be in violation of the homogeneity requirements (see Fig. 3). 
By giving the new region the greater ID (in combination with rule number 2), possible deadlocks 
can be avoided. After merging two regions into a single region, its characteristics will be updated. 

A region that was unable to merge during a given iteration because its selection was not mutual, 
may succeed in a subsequent iteration (see Fig. 3). 

This iteration will be repeated until there is no optimal merge partner for any region left, i.e. until 
all pairs of neighbors violate the merge criterion (see Eq. (1)). 

4.2. Merging Criterion 

The merging criterion, which is used to check the homogeneity between neighboring regions is 
based on fusing intensity and range information into a single value, similar to Ghobadi et al. [5]. 
The choosen homogeneity descriptor w (s. Sec. 4.1 ) which characterizes a region consists of two 
components: 



where denotes the average Z-distance of the region with regard to the camera's optical axis 
and c])^ the fused intensity and range value. 

The characterizing entities for each region ^, i.e. initially a single pixel, are given by the distance 
in spherical sensor coordinates and the reflected active light 7^ of a region/pixel. 

As the desired (j)^ needs to represent homogeneous characteristics of a pixel region in a robust 
way, we apply the well known dependency rule between intensity and the distance to derive a 
robust measure: 



^* = argmin{/(^,5)}. 



= (z*,<k)> 



(3) 




(4) 
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Thus, dc^yJL^ is a nearly constant measure for every individual surface and interpreting each 

(d^,7^)-pair as point in the Euclidean plain, we derive <j>^ as follows: 



When merging regions, the mean average of both regions and will be used for the new 
region's homogeneity descriptor. The values of the thresholds (t z and fy) for the merging criterion, 
as well the weights (a z and o^) needed for the homogeneity difference, are listed in Table 1 . 

The hierarchical merging is applied to each frame delivered by the ToF-camera. The algorithm is 
realized using a GPU-implementation similar to Chiosa and Kolb [ ], which has be introduced for 
mesh and data clustering. We simply use a standard 4-neighborhood-size to define pixel neighbors 
and integrate our homogeneity measure as the merge criterion. 

5. Hand Tracking 

The hand tracking algorithm has been implemented on the CPU and is divided into two steps. Step 
one is the initialization, where the hands will be identified. In step two, both hands will be tracked. 

5.1. Initialization 

In step one, the hands will be detected from the set of regions, that has been clustered until con- 
vergence, i.e. until no further pair of neighboring clusters fulfills the merge criterion (see Eq. (1)). 
The two clusters which will be identified as hands are determined by the following simple rules: 

• closest to the ToF-camera and 

• the cluster size is above a given threshold. 

Both regions will be then assigned as first hand (cluster %x) and second hand (cluster As 
long as these regions satisfy the mentioned criteria, both hands can be clearly assigned in the 
successive image through a nearest neighbor assignment (see Fig. 4). 

After 30 frames, the initialization stops and the tracking is initiated. The following information of the 
hand clusters will be preserved from the initialization: 

• IDi and ID 2 : Region IDs of both hand clusters, 

• (j>i and fa' Homogeneity measure of both hand clusters, 

• pos\ and pos 2 : Cluster center of the first and the second hand in 3D, with regard to the 
camera's optical center. 

All these values are required in the tracking phase. 

5.2. Frame-to-Frame Tracking 

In step two, the maximum r max and minimum r min camera range will be limited to the hands (see 
Fig. 5). Thus, pixels outside the range [r m i n ,r max ] will be ignored for clustering, resulting in a speed 
up for the clustering process. 




(5) 
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r min and r max are recalculated for every frame as follows: 



r, 



max 



nun 



= min(d\^d2) — r tn 
= maxidx^dj) +r t h 



where d\,d 2 are the average distances of the first and the second hand and r th a distance thresh- 
old. r th is needed, so that the user can move his hands freely back and forth and avoids, that the 
hands could be clipped away in the next frame (see Fig. 6). 

After setting the minimum and maximum range, two new hand clusters will be searched in the next 
frame. All the clusters whose size are above a given minimum size size min , will be compared to the 
hand clusters from the last frame, based on their homogeneity measure <|>. From the set of possible 
hand clusters, two clusters will be then assigned as the new hands through a nearest neighbor 
method with regard to the tracking information (ID,§,pos) from the hand clusters identified in the 
last frame. 

5.3. Mutual Occlusion of Hands 

The tracking method described in Sec. 5.2 works robustly as long as two hands are clearly visible 
and distinguishable in the ToF-images. Crossing hands would, however, lead to problems, because 
one hand would be temporarily covered by the other hand, but the tracking algorithm would still 
search for the second hand. 

Therefore, to allow that one hand temporary gets covered by the other hand, the following control 
method has been added to the tracking algorithm. 

Right after the hands have been detected in the initialisation step (see Sec. 5.1), the algorithm 
checks the XF-distance between both hands in every frame. Depending on this distance it will be 
decided if both hands are in a critical area D, where the possibility of covering one hand with the 
other hand in one of the subsequent frames is high. D is defined as follows: 



where (posi)^ is the XF-position of the z-th hand and d min is the minimal distance. 

As long as the distance between both hands is above the threshold d min , the hands will be tracked 
as described in Sec. 5.2). In case if D = true, the following procedure is initiated: 

5.3.0.3. Check for possible disappearance of a single hand A hand is considered missing, if 
in the current frame: 

• no cluster has been found which can resemble the hand (based on (j)), or 

• a cluster has been found, but the distance between its cluster center and the hand position of 
the last frame, is above a given maximal distance, implying that no cluster can be assigned 
to the hand in a meaningful way. 

5.3.0.4. Detection and tracking of the uncovered hand After a hand gets covered by the other, 
the last resolved tracking information of the disappeared hand cluster (backhand cluster %/,) is 
stored in order to detect the lost hand again after reappearance. The front hand cluster ^ fh , 
however, will be continuously tracked as described in Sec. 5.2. The lost hand will be searched in 




true, | (posi)^ - (pos 2 )xy \ < 
false, else. 
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Parameters 


Value 


h 


U.U4m 






a z 


8 

71 




4 

3 


SlZ^min 


200/?x 


td 


0.1m 




0.009 (rad) 


r t h 


0.1m 




0.1m 



Table 1 : Parameters used for the clustering (first four rows) and hand tracking algorithm (rest). 



Process 


time (Init) 


time (Tracking) 


Update Camera 


16 


16 


Find Mergepartner 


29 


27 


Merge Regions 


5 


3 


Update Values 


10 


4 


Tracking 


7 


9 


Total Time 


67 


59 



Table 2: Average runtime (in msec) on an Intel Core 2 Duo PC with 3.00 GHz, 4 GB RAM and a 
NVIDIA GeForce GTX480 graphic card, for204 2 px ToF-data. 



the set of remaining clusters. In order to reassign the lost hand to a cluster ^ the following criteria 
have to be satisfied: 

• The cluster center must be behind the front hand cluster center, in Z-direction, i.e. (pos^) z < 

(P° s ^fh)z- 

• The distance between the cluster center pos^ and the center of the front hand cluster 
pos^ fh , must not exceed a threshold ^: pos^ — pos^ fh < t d 

Thus the lost hand appears beside the hiding hand. 

• The difference between the homogeneity measures of the cluster ^ and the stored measure 
of the back hand cluster must not exceed a threshold tf \§^ bh — <K | < fy- 

If more than one cluster satisfy these criteria, the new uncovered hand cluster will be assigned to 
the one with minimal (j) deviation. If none of the clusters satisfy these criteria, the search for the 
lost hand will continue in the next frame. 



6. Results 

After both hands have been detected in the initialization, the clustering and tracking algorithm runs 
with an average frame-rate of 16 frames/sec (see Table 2). However, some restrictions must be 
satisfied, to ensure a flawless hand tracking. Objects or bodyparts, with a similar reflectivity as 
skin must not get in contact with a hand, i.e. touch the hand in approximately the same Z-distance 
to the ToF-camera, otherwise they will be assigned to one hand cluster (see Fig. 7). 
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A similar error occurs when crossing both hands. Being the fact, that both hands usually have 
the exact same homogeneity measure <]>, not keeping a given minimum distance in Z-direction 
while crossing, can lead to a merging of both hands into a single hand cluster (see Fig. 8). These 
clustering issues can forthermore lead to tracking errors (see Fig. 8(c)). 

However, as long as the mentioned restrictions are satisfied, the tracking algorithm can handle a 
third hand and will always track the correct two visible hands (see Fig. 9). 

As mentioned in Sec. 4.2, the homogeneity measure <|> is based on the idea of Ghobadi et al. [5]. 
The main difference between both measures is, that Ghobadi et al. chose a linear approach to 
describe the relation between distance and intensity of a given cluster % : 



where is the normalized distance and l % the normalized intensity of the cluster. This approach 
can lead to inacurate clustering results (see Fig. 10). With a more accurated calculation of § from 
Eq. (5) those clustering errors are significantly reduced. 

A list with all the parameters and their values, which were applied, are shown in Table 1 . 
7. Conclusion 

In this paper, we introduce a fast an robust hand segmentation approach based on a hierarchical 
clustering technique, which is implemented on the GPU as a parallel merging algorithm in order 
to achieve a high performance. We also introduce a novel homogeneity criterion robustly merging 
regions in fused range-intensity images. The intensity and range information is fused into one pixel 
value, representing its combined intensity-depth homogeneity. 

The results show that when certain restrictions are satisfied, our hand segmentation and track- 
ing system is capable of robustly detecting and tracking both hands in real-time, even under the 
condition that one hand is temporarly covered by the other hand. 
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Figure 1 : Overview of the clustering and hand tracking algorithm. 
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Figure 2: fa) /A clustering into four regions, with their ID and homogeneity measure w. Threshold 
t = 51. (b) Merging result for sequential merging, starting merging from top left to bottom right (c) 
Merging result for sequential merging, starting merging from top right to bottom left. 
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Figure 3: Threshold t = 40. faj Mutual merging decision between region 1 and 2. Region 3 and 4 
have to wait until the next iteration, (b) Merging result from (a). Mutual merging decision between 
region 3 and 1. Region 4 however does not satisfy the merging criteria anymore, (c) Final merging 
result. 



Page 13 of 17 




(c) 



Figure 4: As long as both hands satisfy the criteria from the initialization step, they can always be 
assigned correctly The white and the gray cluster resembles the first hand respectively the second 
hand (left: clustered image; right: intensity image). 
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(b) 

Figure 6: Hand tracking through nearest neighbor assignment. Back and forth movement of the 
hands. 




Page 15 of 17 



(a) 




Figure 7: User holds both hands up. Hand palms and arms are on the same XY -plane. The grey 
and the white clusters are the hand clusters, (a) user wears a skin-colored sweater. The arms and 
hands get clustered together, (b) user wears a brown leather jacket. Arms and hands are clustered 
separately. 



(a) (b) (5) 

Figure 8: (a)-(c) shows a crossing sequence of both hands, where the hand are close together 
also in Z-direction. (b) Since both hands get in contact while crossing each other, a part of the 
back hand gets clustered to the front hand cluster, leading to tracking errors (as seen in (c)). (c) 
because the center of the back hand was shifted, the wrist gets mistaken as the new hand cluster. 
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(c) 



Figure 9: The user's visible hands will be tracked correctly, although a third hand is behind (a), in 
between (b), or in front (c) of the user's tracked hands. 
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(a) (b) 



Figure 1 0: fa) clustering result with (j) fro/77 Ghobadi et al.Eq. (6). Bof/7 tends gef clustered into one 
hand cluster (white cluster) by a connection via the sleeve. The other sleeve gets then mistaken 
as the second hand cluster (grey cluster), (c) clustering result with § from Eq. (5) as proposed in 
this paper. Both hands get clustered correctly. 



